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Abstract: Kalman smoothers reconstruct the state of a dynamical system starting from 
noisy output samples. While the classical estimator relies on quadratic penalization of process 
deviations and measurement errors, extensions that exploit Piecewise Linear Quadratic (PLQ) 
penalties have been recently proposed in the literature. These new formulations include 
smoothers robust with respect to outliers in the data, and smoothers that keep better track 
of fast system dynamics, e.g. jumps in the state values. In addition to L 2 , well known examples 
of PLQ penalties include the L x , Huber and Vapnik losses. In this paper, we use a dual 
representation for PLQ penalties to build a statistical modeling framework and a computational 
theory for Kalman smoothing. 

We develop a statistical framework by establishing conditions required to interpret PLQ 
penalties as negative logs of true probability densities. Then, we present a computational 
framework, based on interior-point methods, that solves the Kalman smoothing problem with 
PLQ penalties and maintains the linear complexity in the size of the time series, just as in the 
L2 case. The framework presented extends the computational efficiency of the Mayne-Fraser 
and Rauch-Tung-Striebel algorithms to a much broader non-smooth setting, and includes many 
known robust and sparse smoothers as special cases. 

Keywords: Piecewise linear quadratic penalties; nonsmooth optimization; ii/Huber/Vapnik 
loss functions; interior point methods 



1. INTRODUCTION 

Consider the following discrete-time linear state-space 
model 

x\ = xq + w\ 

x k = G k x k -i + Wk, k = 2,3,...,N (1.1) 
z k =H k x k + v k , k = l,2,...,N 

where x k £ K™ is the state, Xq is known, z k £ M. m contains 
noisy output samples, G k and H k are known matrices. 
Further, {w k } and {v k } are mutually independent zero- 
mean random variables with covariances given by {Q k } 
and {R k }, respectively. 

The classical fixed-interval Kalman smoothing problem is 
to obtain the (unconditional) minimum variance linear 
estimator of the states {x k }^ =1 as a function of {z k }^ =1 . It 
is well known that the structure of this estimator is related 
to the following optimization problem 

N 

min V \\z k - H k x k \\ 2 1 + \\x k - G k x k -i\\ 2 ! (1.2) 

K— 1 

where G\ denotes the identity matrix and ||a||| := o T Sa 
for every column vector a. When data become available, 
the solution can be computed by the classical Kalman 
smoother with the number of operations linear in N. This 



procedure also provides the minimum variance estimate of 
the states when all the system noises are assumed to be 
Gaussian. 

In many circumstances, linear estimators that rely on 
quadratic penalization of model deviation, such as (1.2), 
lead to unsatisfactory results. For instance, they are not 
robust with respect to the presence of outliers in the 
data [Huber, 1981, Aravkin et al., 2011a, Farahmand 
et al., 2011] and may have difficulties in reconstructing fast 
system dynamics, e.g. jumps in the state values [Ohlsson 
et al., 2011]. In addition, sparsity-promoting regularization 
is often used in order to extract from a large measurement 
or parameter vector a small subset having greatest impact 
on the predictive capability of the estimate for future 
data. This sparsity principle permeates many well known 
techniques in machine learning and signal processing, such 
as feature selection, selective shrinkage, and compressed 
sensing [Hastie and Tibshirani, 1990, Efron et al., 2004, 
Donoho, 2006]. In many circumstances, when smoothing is 
considered, it can be interpreted as a sparse non Gaussian 
prior distribution on the noises entering the system. In 
these cases, the estimator (1.2) is often replaced by 

N 

(z k - H k x k ;R k ) + J (x k - G k Xk-i\Qk) (1-3) 

fe=i 



where, for example, V can be the Huber or the Vapnik's e- 
insensitive loss, used in support vector regression [Vapnik, 
1998, Evgeniou et al., 2000], while J may be the €i-norm, 
as in the LASSO procedure [Tibshirani, 1996]. 
The interpretation of problems such as (1.3) in terms 
of Bayesian estimation has been extensively studied in 
the statistical and machine learning literature in recent 
years and probabilistic approaches used in the analysis of 
estimation and learning algorithms can be found e.g. in 
[Mackay, 1994, Tipping, 2001, Wipf et al., 2011]. Non- 
Gaussian model errors and priors leading to a great va- 
riety of loss and penalty functions are also reviewed in 
[Palmer et al., 2006] using convex-type and integral- type 
variational representations, with the latter being related to 
Gaussian scale mixtures. The fundamental novelty in this 
work is that, rather than taking this approach, we start 
with a particular class of losses, called PLQ penalties, well 
known from optimization literature [Rockafellar and Wets, 
1998]. We establish conditions which allow these losses to 
be viewed as negative logs of true densities, ensuring that 
Wk and Vk in (1.1) come from true distributions. This in 
turn allows us to interpret the solution to the problem 
(1.3) as a MAP estimator when the loss functions V and 
J come from this subclass of PLQ penalties. We will show 
that this subclass includes the four key examples, the 
Li, Huber, and Vapnik penalties. 

The density characterization of PLQ penalties is achieved 
using a dual representation, which also underlies the devel- 
opment of algorithms for fitting models of the form (1.3). 
In particular, in the second part of the paper we derive the 
conditions, complimentary to those needed to set up the 
statistical framework, that allow the development of new 
and computationally efficient Kalman smoothers designed 
using non-smooth penalties on the process and measure- 
ment deviations. Amazingly, it turns out that the interior 
point method used in [Aravkin et al., 2011a] generalizes 
perfectly to the entire class of PLQ densities under a 
simple verifiable non-degeneracy condition. Hence, the so- 
lutions of all the PLQ Kalman smoothers can be computed 
with a number of operations that scales linearly in A, as in 
the quadratic case. Such theoretical foundation generalizes 
the results recently obtained in [Aravkin et al., 201 la, b, 
Farahmand et al., 2011, Ohlsson et al., 2011], framing them 
as particular cases of the framework presented here. 
The paper is organized as follows. In Section 2 we in- 
troduce the class of PLQ convex functions, and provide 
the conditions under which they can be interpreted as 
negative logs of corresponding densities. In Section 3 we 
present a new PLQ Kalman smoother theorem that gen- 
eralizes the well known Mayne-Fraser two-filter and the 
Rauch-Tung-Striebel algorithm [Gelb, 1974] to nonsmooth 
formulations. This theorem is obtained by solving the 
Karush-Kuhn- Tucker (KKT) system for PLQ penalties 
using interior point methods, and exploiting the state 
space structure to obtain the solution in linear time. The 
necessary results and proofs supporting the main theorems 
appear in the Appendix. We end the paper with a few 
concluding remarks. 



2. PIECEWISE LINEAR QUADRATIC PENALTIES 
AND DENSITIES 

2.1 Preliminaries 

We recall a few definitions from convex analysis. 

• (Affine hull) Define the affine hull of any set S, 
denoted by aff S, as the smallest affine set that 
contains S. 

• (Cone) For any set S, denote by cone S the set 
{ts\s eS,t£ R + }. 

• (Polar Cone) For any cone K C R m , the polar of K 
is defined to be 

K° := {v\(v,w) <0Vine K}. 

• (Horizon cone) . The (convex) Horizon cone C°° is the 
set of 'unbounded directions' for C, i.e. d <G C°° if for 
any point w £ C we have {d\w + rd e cl C V r > 0}. 

2.2 PLQ densities 

We now introduce the PLQ penalties and densities that 
are the focus of this paper. 

Definition 2.1. (piecewise linear-quadratic penalties) [Rock- 
afellar and Wets, 1998]. For a nonempty polyhedral set 
U C R rn and a symmetric positive-scmidefinite matrix 
M e R rnxm (possibly M = 0), the function 6 U>M ■ K m -> 
R defined by 

0u,m(w) := sup \(u,w) - \{u,Mu)\ (2.1) 

uEU I Z ) 

is proper, convex, and piecewise linear-quadratic. When 
M = 0, it is piecewise linear; 6jj a — ajj, the support 
function of U . The effective domain of 9u,m, denoted by 
dom(#[/ m), is the set of w € R m where 9u m(w) < 00, and 
is given by (U°° n Null( M ))° . ■ 

In order to capture the full class of penalties of interest, 
we consider injective affine transformations into R m of the 
form b + By. The requirements on B therefore are m> n 
and Null(-B) = {0}. The final technical requirement we 
impose is that b £ dom 6jj,m- 

Definition 2.2. (PLQ penalties with shifts and trans- 
forms) Using (2.1), define p : R" -> R as 6 u>M (b + By): 

Pu,M-b,B (y) ■= sup { (u, b + By) - I (u, Mu) \ (2.2) 
ueu I ■ 

The following result characterizes the effective domain of 
p (see Appendix for proof). 

Theorem 2.3. Let p denote Pu,M,B,b(y), an d K denote 
U°° n Null(M). Suppose U c'lTMs a polyhedral set, 
y e M™, b e K°, M e R rnxm is positive semidefinite, and 
B E R mxn is injective. Then we have (B T K)° C dom(p) 
and (B T (K n —K)) 1 - = aff(dom(p)). ■ 

Note that the functions p are still piecewise linear- 
quadratic. All of the important examples mentioned before 
can be represented in this way, as shown below. 
Remark 2.4-. (scalar examples). The L 2 , l\, Huber, and 
Vapnik penalties are representable in the notation of 
Definition 2.2. 
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V(x) = 


-x - e; x < -t 


l/(x) = 


0; -e < i < e 


V(x) = 


x — e; e < j: 







Fig. 1. Huber (left) and Vapnik (right) Penalties 
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0, and B = 1. We 
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the sup is maximized at u = y, whence p(y) = \y 2 . 

(2) l x : Take J7 = [-1,1], M = 0, 6 = 0, and 5 = 1. 
We obtain p(x) = sup (uy) . The function inside 

tt£[-l,l] 

the sup is maximized by taking u = sign(y), whence 

pO) = \y\- 

(3) Huber: Take U = [-K, K], M = 1, b = 0, and 5 = 1. 

1 



We obtain p(y) 



uy 



Take the 



1 K 2 

2 i\ . 



sup 

derivative with respect to u and consider the following 
cases: 

(a) If y < —K, take u = —K to obtain — Ky — 

(b) If —K <y<K, take u = y to obtain \y 2 . 

(c) If y > K, take u = K to obtain a contribution of 
Ky-\K 2 . 

This is the Huber penalty with parameter K, shown 
in the left panel of Fig. 1. 
(4) Vapnik: take U = [0, 1] x [0, 1], M = [g g], B 

and b — [ ~* ] , for some e > 0. We obtain p(y) 



V 



SU P M1 , U2 G[0,1] 



y- e 



We can obtain an 



explicit representation by considering three cases: 

(a) If \y\ < e, take u\ = U2 = 0. Then p(y) = 0. 

(b) If y > e, take u\ = 1 and U2 = 0. Then p(y) = y — 
e. 

(c) If y < — e, take Hi = and it 2 = 1. Then 

_ p(y) = -y- £■ 

This is the Vapnik penalty with parameter e, shown 
in the right panel of Fig. 1. 

Note that the affine generalization (Definition 2.2) is 
already needed to express the Vapnik penalty. ■ 

In order to characterize PLQ penalties as negative logs of 
density functions, we need to ensure the integrability of 
said density functions. A function p(x) is called coercive if 
lim|| 2: ||_ ! . 0O p{x) — 00, and coercivity turns out to be the key 
property to ensure integrability. The proof of this fact, and 
the characterization of coercivity for PLQ penalties using 
the notation of Def. 2.2, are the subject of the next two 
theorems (see Appendix for proofs). 

Theorem 2.5. (PLQ Integrability). Suppose p(y) is coer- 
cive, and let n a g denote the dimension of aff (dom p). Then 
the function f(y) = exp(—p(y)) is integrable on aff (dom p) 
with the n a ff-dimensional Lebesgue measure. ■ 



Theorem 2.6. (Coercivity of p). p is coercive if and only if 
[B T cone(C/)]° = {0}. ■ 

Theorem 2.6 can be used to show the coercivity of familiar 
penalties. 

Corollary 2.7. The penalties L2, L\, Vapnik, and Huber 
are all coercive. 

Proof: We show all of these penalties satisfy the hypoth- 
esis of Theorem 2.6. 

(1) L 2 : U = R and B = 1, so [B T cone(f7)]° = R° = {0}. 

(2) Ir- U = [-1, 1], so cone(f/) = R, and B = 1, so proof 
reduces to that case 1. 

(3) Huber: U = [-K, K], so cone(£7) = R, and B = 1, so 
proof reduces to that of case 1. 

(4) Vapnik: U = [0,1] x [0,1], so cone{U) — R^_. B = 
[ -1 ] , so -B T cone(£7) = R, and again we reduce to 
case 1. ■ 

We now define a family of distributions on K™ by inter- 
preting piecewise linear quadratic functions p as negative 
logs of corresponding densities. Note that the support 
of the distributions is always contained in the affine set 
aff (dom p), characterized in Th. 2.3. 

Definition 2.8. (Piecewise linear quadratic densities). Let 
p be any coercive piecewise linear quadratic function on 
W 1 of the form pu,M,B,b-,(y) = Qu,mQ> + By). Define p(y) 
to be the following density on W 1 : 

' c{ x exp(-p(y)) y E dom p 




p(y) = 



where 



Cl 



(I 



else, 
exp(-p(y))dy 



(2.3) 



' yfEdom p 

and integral is with respect to the Lebesgue measure with 
dimension dim^aff(dom p) 

PLQ densities are true densities on the affine hull of the 
domain of p. The proof of Theorem 2.5 can be easily 
adapted to show that they have moments of all orders. 

3. KALMAN SMOOTHING WITH PLQ PENALTIES 

In this section, we consider the model (1.1), but in the 
general case where errors Wk and Vk can come from any 
of the densities introduced in the previous section. To this 
end, we first formulate the KS problem over the entire 
sequence {xk}. 

Given a sequence of column vectors {uk} and matrices 
{Tfc} we use the notation 

rTi 



vec({u fc }) 



UN. 



, diag({T fe }) 



Tn 











■■ 

T N . 



We make the following definitions. 

x = vec{xi, ■ • • ,xn} , w = vec{wi, • • • , wk } 

v = vee{i>i, ■■■ ,v k } , Q = diag{Qi, • • • 7 Q N } 

R = diag{i?i, • • • ,R N } , H = diag{i?i, • • • ,H N }. 

We also introduce the matrices G and H: 



G = 



■ I 
-G 2 I 



'•• 

-G N I. 



H = 



r^i o 

o H 2 



'■■ 

H N . 



With this notation, model (1.1) can be written 
x = Gx + w 



z = Hx + v 



(3.1) 



where x G 



iN 



is the entire state sequence of interest, 



w is corresponding process noise, z is the vector of all 
measurements, v is the measurement noise, and x is a 
vector of size nN with the first n-block equal to xo, the 
initial state estimate, and the other blocks set to 0. 

The general Kalman smoothing problem is described in 
the following proposition. 

Proposition 3.1. Suppose that the noises w and v in the 
model (3.1) are PLQ densities with means 0, variances Q 
and R (see Def. 2.8). Then, for suitable U w ,M w 1 b w 1 B w 
and U v 1 M v 1 b v 1 B v we have 

p( w ) oc exp(-0c / »,M»(&"' + B W Q- X ' 2 w)) 
p(v) oc exp(-0u*,M"(P v +B v R- 1 ' 2 v)) 
while the MAP estimator of x in the model (3.1) is 
e u » M *,{V a + B w Q-V\Gx-x )) \ 



(3.2) 



arg mm x , 

^K" N 1 +6u« iM *(b v + B v R~ 1 / 2 (Hx- z)) 



(3.3) 



Note that since Wk and vu are independent, problem (3.3) 
is decomposable into a sum of terms analogous to (1.2). 
This special structure is manifest in the block diagonal 
structure of H,Q,R,B V ,B W , the bidiagonal structure of 
G, and the structure of sets U w and U v , and is key in 
proving the linear complexity result that will be derived 
in the next part of this section. 

For our purposes, it is now important to recall that, 
when the sets U w and U v are polyhedral, (3.3) is an 
Extended Linear Quadratic program (ELQP), described 
in [Rockafellar and Wets, 1998, Example 11.43]. Rather 
than directly solving (3.3), we work with the Karush- 
Kuhn- Tucker (KKT) system. We present the system in 
the following lemma, and derive it in the Appendix. 

Lemma 3.2. Suppose that the sets U w and U v are poly- 
hedral, i.e. can be written 

U w = {u\(A w ) T u < a w }, U v = {u\(A v ) T u < a v } . 

Then the necessary first-order conditions for optimality 
of (3.3) are given by 



= {A w ) T u w + s v 
= (s w ) T q w ; 



= (A v ) T u v + s v 
= (s w ) V 



= b w +B w Q- 1 l 2 Gd 
= 6"- B v R- 1/2 Hd- M v u 



M w-w _ A w qW 



= G T Q- T / 2 (B W )' ] 
< s w ,s v ,q w ,q v . 



H T R- T ' 2 (B V )' 1 



(3.4) 



Our approach is to solve (3.4) directly using Interior Point 
(IP) methods. IP methods work by applying a damped 



Newton iteration to a relaxed version of (3.4), specifically 
relaxing the 'complementarity conditions': 

(s w ) T q w = -> Q W S W 1 -fll = 
{s v ) T q v =0 -> Q V S V 1-^1 = 0, 

where Q w ,S W 7 Q V 7 S V are diagonal matrices with diagonals 
q w ,s w ,q v , s v respectively. The parameter \x is aggressively 
decreased to as the IP iterations proceed. Typically, 
no more than 10 or 20 iterations of the relaxed system 
are required to obtain a solution of (3.4), and hence an 
optimal solution to (3.3). The following theorem is key 
and represents the main result of this section. It shows 
that the computational effort required (per IP iteration) 
is linear in the number of time steps whatever PLQ density 
enters the state space model. 

Theorem 3.3. (PLQ Kalman Smoother Theorem) Suppose 
that all Wk and Vk in the Kalman smoothing model 
(1.1) come from PLQ densities that satisfy Null(M) n 
U°° = {0}, i.e. their corresponding penalties are finite- 
valued. Then (3.3) can be solved using an IP method, with 
computational complexity 0(Nn 3 + Nm) time. ■ 

The proof is presented in the Appendix and shows that IP 
methods for solving (3.3) preserve the key block tridiago- 
nal structure of the standard smoother. General smoothing 
estimates can therefore be computed in 0(Nn 3 ) time, as 
long as the number of IP iterations is fixed (as it usually 
is in practice, to 10 or 20). 

It is important to observe that the motivating examples 
(see Remark 2.4) all satisfy the conditions of Theorem 3.3. 

Corollary 3.4- The densities corresponding to L , L 2 , Hu- 
ber, and Vapnik penalties all satisfy the hypotheses of 
Theorem 3.3. 

Proof: We verify that Null(M) n Null(A T ) = for each 
of the four penalties. In the L 2 case, M has full rank. For 
the L 1 , Huber, and Vapnik penalties, the respective sets 
U are bounded, so U°° = {0}. 

4. CONCLUSIONS 

We have presented a new theory for robust and sparse 
Kalman smoothing using nonsmooth PLQ penalties ap- 
plied to process and measurement deviations. These 
smoothers can be designed within a statistical framework 
obtained by viewing PLQ penalties as negative logs of 
true probability densities, and we have presented necessary 
conditions that allow this interpretation. In this regard, 
the coercivity condition characterized in Th. 2.6 has been 
shown to play a central role. Notice that such a condition 
is also a nice example of how the statistical framework 
established in the first part of this paper gives an alter- 
native viewpoint for an idea useful in machine learning. 
In fact, coercivity is also a fundamental prerequisite in 
sparse and robust estimation as it precludes directions 
for which the loss and the regularizer are insensitive to 
large parameter/state changes. Thus, the condition for 
a (PLQ) penalty to be a negative log of a true density 
also ensures that the problem is well posed and that the 
learning machine/smoother can control model complexity. 
In the second part of the paper, we have shown that 
solutions to PLQ Kalman smoothing formulations can be 
computed with a number of operations that is linear in 
the length of the time series, as in the quadratic case. 



A sufficient condition for the successful execution of IP 
iterations is that the PLQ penalties used should be finite 
valued, which implies non-degeneracy of the corresponding 
statistical distribution (the support cannot be contained 
in a lower-dimensional subspace). The statistical inter- 
pretation is thus strongly linked to the computational 
procedure. 

The computational framework presented allows a broad 
application of interior point methods to a wide class of 
smoothing problems of interest to practitioners. The pow- 
erful algorithmic scheme designed here, together with the 
breadth and significance of the new statistical framework 
presented, underscores the practical utility and flexibility 
of this approach. We believe that this perspective on model 
development and Kalman smoothing will be useful in a 
number of applications in the years ahead. 
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APPENDIX 

Preliminaries 

Definition 4-1- (Horizon cone, specialized to the convex 
setting by [Rockafellar and Wets, 1998, Theorem 3.6]). The 
Horizon cone C°° for a convex set C is convex, and for any 
point w € C consists of the vectors {d\ w + rd e cl C V t > 
0}. 

Definition 4-2. (Lineality). Define the lineality of convex 
cone K, denoted \in(K), to be K(l— K. Since K is a convex 
cone, \in(K) is the largest subspace contained in K. 
Lemma 4-3. (Characterization of lineality, [Rockafellar, 
1970, Theorem 14.6]). Let K be any closed set containing 
the origin. Then lm(K) = (K ) 1 - . 

Definition 4-4- (Affine hull). Define the affinc hull of any 
set S, denoted by aff S, as the smallest affine set that 
contains S. 

Corollary 4-5. (Characterization of aff K°) Taking the 
perp of the characterization in Lemma 4.3, the affinc 
hull of the polar of a closed convex cone K is given by 
aff K° = lin(/f)- L . 

Proof of Theorem 2.3 

Lemma 4-6. (Polars, linear transformations, and shifts) 
Let K C K™ be a closed convex cone, b e W L , and 
B e R nxk . Then we have {B T K)° C B- X (K° - b) if 

beK°. 

Proof: Recall that a convex cone is closed under addition. 
Then for any b € K ° , we have K° + b C K° , and hence 
K° C K° - b. By [Rockafellar, 1970, Corollary 16.3.2] we 
get 

(B T K)° = B- X K° C B~ l (K° - b) . 

■ 

Corollary 4-7. Let K be a closed convex cone, and B e 
R nxk . If be K°, then {B T {\m{K)))^ C aff {B~ l {K° -&)). 

Proof: By Lemma 4.6, aff (if - b)) D aS{B T K)° = 
(\m(B T K)) ± where the last equality is by Corollary 4.5. 
Since B isa linear transformation, we have lm(B T K) = 
B T \m(K). 

m 

Lemma 4-8. Let K C K™ be a closed convex cone, b e 
aff(iT)°, and B E R nxk . Then aff(B- 1 (/sT - 6)) C 
B-HliniK)) 1 - C (B-^dSiK - b)). 

Proof: If w € aff (B^ 1 (K° -b)), for some finite N we 
can find sets {AJ C M and {wj C B^ 1 ^ - b) such that 

SiLi = 1 and ^i w i = w - For each Wi, we have 

Bu>i e K° - b, so b + Bw, € K°. Then 

JV 

b + Bw = + Bwi) G aff (K°) = lin(X)- L . 

»=i 



Since b £ lin(if)- L by assumption, we have Bw £ Im(K)- 1 -, 
and so w £ B~ 1 (\ia(K) ± ). 

Next, starting with w £ 1 (lin( J ftT )-*-) we have Bw £ 
Im(K) 1 - and so b + Bw £ Im(K) 1 - since lin(K) is a 
subspace and b £ Im(K)- 1 . Then for some finite N we can 
find sets {Ai} C K and {t>i} C K° such that = 1 

and Kvi = b+Bw. Subtracting b from both sides, we 

have X)t=i Ai(«i — &) = Bw, so in particular £ aff — 
6). Then w £ B^ 1 aS(K° — b). 

m 

Theorem 4-9. Let K C K™ be a closed convex cone, 
6 € K", and B e M" xfe . If b £ K° , then (B T lin(if))- L = 
aff {B- l (K° - b)) = B- l {Ym{K)^). 

Proof: From Corollary 4.7 and Lemma 4.8, we immedi- 
ately have 

{B T Ym{K)) 1 - C aff {B'^K - b)) C B' 1 {\in{K)^) . 

Note that for any subspace S, S 1 - = S° . Then by 
[Rockafellar, 1970, Corollary 16.3.2], (B^iniK)) 1 - = 
B-^liniK)^). 



The proof of Theorem 2.3 now follows from Lemma 4.6 
and Theorem 4.9. 

Proof of Theorem 2. 5 

Using the characterization of a piecewise quadratic func- 
tion from [Rockafellar and Wets, 1998, Definition 10.20], 
the effective domain of p(y) can be represented as the 
union of finitely many polyhedral sets Ui, relative to 
each of which p(y) is given by an expression of the form 
^(y,Aiy) + (a,i,y) + on for some scalar on £ R, vec- 
tor ai £ R n and symmetric positive semidefinite matrix 
Ai £ M™ x ". Since p(y) is coercive, we claim that on each 
unbounded Ui there must be some constants and /3j > 
so that for \\y\\ > iVj we have p{y) > A II 2/ II- Otherwise, we 
can find an index set J such that p(yj) < flj\\yj\\, where 
/3j I and \\yj\\ t oo. Without loss of generality, suppose 
converges to y £ U°° , by [Rockafellar, 1970, Theorem 



8.2]. By assumption, 



ll w 



Ili/ill m '\\\yAV^hA/ ' Y l, \\yj\\/ ' \\yjW 

Taking the limit of both sides over J we see that 

(llsrlf' J ^ i Jy 7 \\'} must conver g e to a finite value. But 
this is only possible if (y, Aiy) = 0, so in particular we 
must have y £ Null(Aj). Note also that (a i} y) < 0, by 
taking the limit over J of 

P(Vj) ^ / Vi \ a 
'llifcl 



A, 



I 0, and we have 
Vi \ , /„ Vo 



> 



+ 



Wi 



so for any xo £ Ui and A > we have xq + Xy £ Ui since 
y £U°° and 

p(x + Xy) < p(x ) +a u 

so in particular p stays bounded as A f oo and cannot be 
coercive. 



The intcgrability of f(y) is now clear. First note that f(y) 
is bounded below by 0. Recall that the effective domain of p 
can be represented as the union of finitely many polyhedral 
sets Ui, and for each unbounded such Ui we have shown 
f(y) < exp[— Albll] off 01 some bounded subset of Ui. 
An elementary application of the bounded convergence 
theorem shows that / must be integrable. 

Proof of Theorem 2. 6 

First observe that [B^ 1 (conc([/)]° = [B T cone(U)]° by 
[Rockafellar, 1970, Corollary 16.3.2]. 

Suppose that y £ B _1 ((cone U)°), and j / 0. Then 
By £ cone(U), and By ^ since B is injective, and we 
have 

p{ry) = sup (b + rBy, u) — -u T Mu 

u£U 2 

= sup (b, u) — -u T Mu + r(By, u) 
ueu 2 

< sup (6, u) — -u T Mu 
ueu 2 



< 



7 U,M 



(b), 



so p(ry) stays bounded even as r — > oo, and so p cannot 
be coercive. 

Conversely, suppose that p is not coercive. Then we can 
find a sequence {yk} with \\yk\\ > k and a constant K so 
that p(yk) < for all k > 0. Without loss of generality, 
we may assume that — > y. 



Then by definition of p, we have for all u £ U 

(b + By k ,u) - \u T Mu < K 
■ 1 



(b + By k , u) <K + ^u T Mu 



b + Byk . , K 



1 



T U T MU 



\Vk\\ \\Vk\\ 2||y fe || 

Note that y 7^ 0, so i?y 7^ 0. When we take the limit as 
fc — > 00, we get (By,u) < 0. From this inequality we see 
that By £ (cone U)° , and so y £ _B _1 ((cone U)°). 



Proof of Lemma 3.2 

The Lagrangian for (3.3) for feasible (x,u w ,u v ) is 
L(x, u w ,u") = 



~u w ~ 




'u w ~ 


T 


M w " 




u w ~ 


u v 


)-\ 


u v 




M v 




u v 



-B W Q^' 2 G 

B v R -l/2 H 



(4.1) 

where b w = b w -B w Q- 1 ' 2 x and b v = b v -B v PC 1 ! 2 z. The 
associated optimality conditions for feasible {x,u w ,u v ) are 
given by 

G T Q- T / 2 (B w ) T u w - H T R- T ' 2 (B v ) T u v = 

b w - M w u w + B w Q- 1 / 2 Gx £ Nun, (u w ) (4.2) 

b v - M v u v - B v R- 1/2 Hx £ Nu,(u v ) , 

where Nc(x) denotes the normal cone to the set C at the 
point x (see Rockafellar [1970] for details). 

Since U w and U v are polyhedral, we can derive ex- 
plicit representations of the normal cones Nu^{u w ) and 



Njjv(u v ). For a polyhedral set U C M. m and any point 
u € U, the normal cone Njj(u) is polyhedral. Indeed, 
relative to any representation 

U = {u\A T u < a} 

and the active index set I(u) := {i\(Ai,u) — a*}, where 
Aj denotes the ith column of A, we have 



N v {u) 



qiAi H h q m A m | q { > for i G J(u) 

<7i = fori ^ 7(u) 



(4.3) 

Using (4.3), Then we may rewrite the optimality condi- 
tions (4.2) more explicitly as 

G T Q- T / 2 (B w ) T u w - H T R- T / 2 (B v ) T u" = 

b w - M w u w + B w Q- 1 ' 2 Gd = A w q w 

b v - M v u v - B v R- l ' 2 Hd = A v q v ( 4 - 4 ) 

{q v > 0\qV = for i I(u v )} 

{q w > 0\qf = for i # I(u w )} 

Define slack variables s w > and s v > as follows: 

s w = a w - (A w ) T u w 
s v = a" - (A v ) T u v . 

Note that we know the entries of qf and q? are zero if and 
only if the corresponding slack variables s\ and sf are 
nonzero, respectively. Then we have (q w ) T s w = (q v ) T s v = 
0. These equations are known as the complementarity con- 
ditions. Together, all of these equations give system (3.4). 



(4.5) 



4.1 Proof of Theorem 3.3 

IP methods apply a damped Newton iteration to find the 
solution of the relaxed KKT system — 0, where 



f sW \ 



W 

This entails solving the system 
f sW \ 



(A w ) T u w + s w - a w 
(A v ) T u v +s v - a v 
D{q™)D{s w )l - M l 
D(q v )D{s v )l-ia 
b w + B w Q- 1 / 2 Gd - M w u w - A w q w 
b v - B v R- 1 ^ 2 Hd - M v u v - A v q v 
G T Q- T / 2 (B w ) T u w - H T R- T / 2 (B v ) T u v 



As w ' 
As v 
Aq w 
Aq v 
Au w 
Au v 
Ad 



( S Z\ 



(4.6) 



where the derivative matrix is given by 



i o 

/ 

Q m 

Q v 













s" 



-A" 








-M w 










-M v 



B Q 



{B ) 



>(B")' 



(4.7) 



We now show the row operations necessary to reduce the 
matrix in (4.7) to upper block triangular form. After 
each operation, we show only the row that was modified. 



row 3 <— row 3 — D(q w ) rowi 
[0 D(s w ) -D{q w )(A w ) T 0] 
row4 <— row 4 — D(q v ) row2 
[0 D(s v ) -D{q v ){A v ) T 0] 
row 5 «- row 5 + A w D{s w y 1 row 3 
[0 —T w B W Q- 1 / 2 G] 
row6 <— row 6 + A v D(s v )~ 1 row 4 
[0 -T v -B V R- 1/2 H] . 
In the above expressions, 

T w := M w + A W (S W )- 1 Q W (A W ) T 
T v := M v +A v (S v y 1 Q v (A v ) T , 

where {S W )- 1 Q W and (S V )- 1 Q V are always full-rank di- 
agonal matrices, since the vectors s w , q w , s v ,q v are always 
strictly positive in IP iterations. The invertibility of T w 
and T v is charachterized in the following lemma. 
Lemma 4-10. (Invertibility of T) Let 9u,m(-) be any PLQ 
penalty on R k , with U = {u A T u < a}. Let T e := M + 

ADA T , where D is any diagonal k x k matrix with positive 
entries on the diagonal. Then Tg is invcrtiblc if and only 
if Null(M) n U°° = {0}, or dom(e UM ) is K fe . 



(4.8) 



Proof: Note that 
Null(M + ADA T ) = {w 

= {w 



w T Mw + w T ADA T w = 0} 
w e Null(M) , w e Null(A T )} 



= Null(M)nNull(A T ). 

The first claim now follows from the fact that U°° = 
Null(yl T ). Recall that the effective domain of 9 is given by 
(Null(M) n U°°)°, and it is immediate from the definition 
of 'polar' that 0° = R fe . ■ 

Remark 4-11- (Block diagonal structure of T in i.d. case) 
Suppose that y is a random vector, y = (yi ■■■ y»), 
where each yi is itself a random vector in M. mi , from 
some PLQ density p(yi) oc exp[—C20[/ i ,M i ((•))]) an< ^ au 
yi are independent. Let Ui = {u : Afu < a^}. Then 
the matrix Tg is given by = M + ADA T where 
M = diag[Mi,--- ,M N ], A = diag[Ai,--- ,A N ], D = 
diag[Di,-- - , Dn], and {A} are diagonal with positive 
entries. Moreover, Tg is block diagonal, with ith diagonal 
block given by M; + AiDiAj. ■ 

Corollary 4-12. (T matrices in the Kalman smoothing 
context) The matrices T w and T v in (4.8) are block diago- 
nal provided that {u>k} and {vk} are independent vectors 
from any PLQ densities. Moreover, if these densities all 
satisfy the characterization in Lemma 4.10, these matrices 
are also invertible. ■ 

We now finish the reduction of F^ to upper block 
triangular form: 

row7 rowy 

( 



G T g- T / 2 (^) T (r-)- 1 Vow 5 



H T R~ T / 2 (B 




| row6 




'I 


(A W ) T 








0/0 





(^) T 





S w - 


Q W (A W ) T 








5" 





Q V (A V ) T 








rpW 





B W Q- X I 2 G 








rpV 


-B V RT 1 / 2 H 














where 

$ = $ G + $ H = G T Q- T/2 {B W ) T (T W )- 1 B W Q- 1 ^ 2 G 
+ H T R- T ^(B V ) T (T V )- 1 B V R- 1 ^ 2 H. 

(4.9) 

Note that <I> is symmetric positive definite. Note also that 
<f> is block tridiagonal, since 

(1) &h is block diagonal. 

(2) Q- T / 2 (B W ) T (T W )- 1 B W Q- 1 / 2 is block diagonal, and 
G is block bidiagonal, hence 3>a is block tridiagonal. 

Solving system (4.6) requires inverting the block diagonal 
matrices T v and T w at each iteration of the damped 
Newton's method, as well as solving an equation of the 
form $Aa; = g. We have already seen that $ is block 
tridiagonal, symmetric, and positive definite, so $A.t = g 
can be solved in 0(Nn 3 ) time using the block tridiagonal 
algorithm in [Bell, 2000]. The remaining four back solves 
required to solve (4.6) can each be done in Q(nN) time. 



